Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

Merged
merged 19 commits into from
May 26, 2024

Conversation

hanishkvc
Copy link
Contributor

@hanishkvc hanishkvc commented May 22, 2024

In Completion mode

  • now by default, the logic doesnt insert any role related "THE_ROLE: " prefix to the message, when generating the prompt json entry to the server for /completions endpoint. This gives full control to the end user to decide if they want to insert any prefix or not, based on the model they are communicating with in the completion mode. Given that /completion endpoint doesnt add any additional in-between special-tokens/tags/chat-templating of its own, using the multiline user input support, a end user could generate any combinatiion of messages with special tokens etal, as they see fit to check how the model will respond. Or use it just for a normal completion related query.

  • avoid the implicit textarea enter-key/newline char wrt the user-input, however if the end user explicitly wants newline in the sent message, they can always use shift+enter key press to insert the same.

  • fix a corner case with normally unexpected system prompt with completion mode, messing with the logic. Now its up to the end user if they want to put a system prompt (normally not expected nor needed in completion mode) or not, the logic wont raise an exception.

In general

  • update the readme as well as the usage message to capture the flow better.

  • add a placeholder hint wrt the system prompt

@mofosyne, do wait for half a day (12 hours), I will cross check things once later today, before you merge this. At sametime if anyone is trying to use SimpleChat in the mean time, the commits in this PR could be useful.

@hanishkvc hanishkvc marked this pull request as draft May 22, 2024 23:31
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 23, 2024
@hanishkvc
Copy link
Contributor Author

Consolidate global control variables into gMe, to allow developers to modify some of the behaviour at runtime by changing gMe members in browser's devel-tools console.

Add support for a optional (disabled by default) sliding window wrt what amount of the chat history is sent to the server/ai-model when chating.

Add configuration fields sent to /chat/completions or /completions end points, as members of gMe (chatRequestOptions),
instead of hard coding in the code flow of the logic. This allows these fields to be modified/added-to/deleted from dev console. Also add max_tokens field to the mix when communicating with server endpoints.

Initial skeleton, for future, for trying to fetch json data as it is getting generated, rather than in one shot at the end, if servers support the same (need to look in to web service api in more detail later).

Update the usage messages shown wrt new/fresh chat. Also show the current status of dev console controllable configurations.

To keep things simple, avoid mixing chat and completion messages, once at the time of switch, when switching from chat to completion mode.

@hanishkvc hanishkvc changed the title SimpleChat Completion Mode flexibility and cleanup SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window May 23, 2024
@hanishkvc hanishkvc marked this pull request as ready for review May 23, 2024 23:13
Just have a alert msg wrt needing javascript enabled in html. And
have usage message from js file. Update the usage message a bit.
So also enable switch session wrt setup_ui call.

Add a possible system prompt as a placeholder for the system-input.
In completion mode

* avoid inserting Role: prefix before each role's message

* avoid inserting newline at the begin and end of the prompt
  message. However if there are multiple role messages, then
  insert newline when going from one role's message to the
  next role's message.
Readme update wrt completion mode behavior.

Usage help updated wrt completion mode behavior.

When changing from input to textarea elment wrt user input, the last
newline at the end of the user input wrt textarea, was forgotten to be
filtered, this is fixed now. However if user wants to have a explicit
newline they can using shift+enter to insert a newline, that wont be
removed. The extra newline removal logic uses substring and keyup to
keep things simple and avoid some previously noted bugs wrt other
events in the key path as well as IME composition etal.
previous logic would have cleared/reset the xchat, without doing
the same wrt iLastSys, thus leading to it pointing to a now non
existent role-content entry.

So if a user set a system prompt and used completion mode, it would
have done the half stupid clear, after the model response was got.
Inturn when user tries to send a new completion query, it would
inturn lead to handle_user_submit trying to add/update system prompt
if any, which will fail, bcas iLastSys will be still pointing to a
non existant entry.

This is fixed now, by having a proper clear helper wrt SC class.
Previously any chat history including model response to a completion
query would have got cleared, after showing the same to the user,
at the end of handle_user_submit, rather than at the begining.

This gave the flexibility that user could switch from chat mode
to completion mode and have the chat history till then sent to
the ai model, as part of the completion query. However this flow
also had the issue that, if user switches between different chat
sessions, after getting a completion response, they can no longer
see the completion query and its response that they had just got.

The new flow changes the clearing of chat history wrt completion
mode to the begining of handle_user_submit, so that user doesnt
lose the last completion mode query and response, till a new
completion mode query is sent to the model, even if they were to
switch between the chat sessions. At the same time the loss of
flexibility wrt converting previous chat history into being part
of the completion query implicitly doesnt matter, because now
the end user can enter multiline queries.
For later

the server flow doesnt seem to be sending back data early, atleast
for the request (inc options) that is currently sent.

if able to read json data early on in future, as and when ai model
is generating data, then this helper needs to indirectly update
the chat div with the recieved data, without waiting for the
overall data to be available.
Keep the title simple so that print file name doesnt have chars
that need to be removed.

Update readme wrt some of the new helpers and options.

Change Usage list to a list of lists, add few items and style it
to reduce the margin wrt lists.
As some times based on the query from the user, the ai model may get
into a run away kind of generation with repeatations etal, so adding
max_tokens to try and limit this run away behaviour, if possible.
This allows the end user to see the settings used by the logic,
as well as allows users to change/update the settings if they
want to by using devel-tools/console
This is disabled by default. However if enabled, then in addition
to latest system message, only the last N user messages, after the
latest system message and its reponses from the ai model will be sent
to the ai-model, when querying for a new response.

This specified N also includes the latest user query.
@hanishkvc hanishkvc force-pushed the hkvc_webfrontend_simplechat_v2 branch from 45eb798 to 11d2d31 Compare May 23, 2024 23:28
@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 23, 2024

Hi @mofosyne,

This set of updates/commits is ready for merging with main/master.

[Look at next comment]

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 24, 2024

Hi @ggerganov @mofosyne @ngxson @teleprint-me / all

Based on some testing across 7b and below models (Llama3, Mistral v02, Phi3, Phi2) on a 8GB laptop, what I notice is that

  • [A] confining the context sent to the server/model to system + last-query-response + current-query seems to generally lead to ok enough responses from the model. With additional max_tokens=512 limiting any run away garbage/repeat text generation. This also limits impact of run-away garbage/text from past chat not affecting beyond immidiate next question, in general.

  • [B] sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries. Setting max_tokens limits the runaway garbage to specified limit (wrt /chat/completions endpoint) but many a times may also limit the answer generation to be chopped/partial in subsequent questions more often.

So do you want the default context window sent from the client side to server to be limited to system+last-query-response+cur-query or do you want the client to always blindly send the full chat history wrt a given chat-session.

For a developer keeping client side context window limiting logic to unlimited, may be useful to test server/ai-model implications. While for a simple/normal end-user usage keeping client side context window limited to system + last-qeury-response + cur-query may make some sense.

Users also have the option of using NewCHAT to start a new chat session.

NOTE: The implemented client side context window limiting logic is flexible and one can control the amount of chat history sent interms of latest system prompt update in a given chat session and how much of latest user query + assistant responses after that should be included or should full history be included. But the two possibilities I mentioned above indicates the issue and possible solution in a simple and straight way.

NOTE: @ngxson the server currently doesnt seem to honor max_tokens wrt /completions endpoint, while /chat/completions endpoint does seem to honor it. This angle could also help fine tune the CI test failure which was seen wrt server testing, based on a quick glance at what I saw when it was noticed on the original simplechat pr.

@ngxson
Copy link
Collaborator

ngxson commented May 24, 2024

the server currently doesnt seem to honor max_tokens wrt /completions endpoint, while /chat/completions endpoint does seem to honor it. This angle could also help fine tune the CI test failure which was seen wrt server testing, based on a quick glance at what I saw when it was noticed on the original simplechat pr.

In the /chat/completions endpoint, we uses the field name max_tokens because it provides compatibility with OAI API.

The plain /completions is not OAI-compat, the field name is n_predict

@ngxson
Copy link
Collaborator

ngxson commented May 24, 2024

sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries.

Regarding this point, maybe you can enable cache_prompt: true for each request. This allows you to re-use KV cache, so when you add a new message to chat history, it won't need re-evaluate all old messages.

However, this approach only works when you add more messages to the history. This won't work with sliding window, because we remove earlier messages in the history. Please have a look at #5793 for more details.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 24, 2024

Hi @ngxson @ggerganov

Will checkout behaviour with cache_prompt (with and without max_tokens), at same time I am not fully sure this will solve the core issue, which is to an extent caused by model's context window.

#5793, I assume you are mentioning wrt future, as it seems to be paused for now? Also from a casual look at things mentioned in it, the current caching (as well as the suggestion discussed in it) is mainly useful for the common case of having the core user entered context in system prompt and or system-prompt + 1st few fixed handshake (which can be indirectly taken as a larger system prompt) inturn followed by a changing tail in the query/chat sent. Which either way inturn also seem to favor a large-system-prompt along with a client-side sliding window mechanism.

sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries.

Regarding this point, maybe you can enable cache_prompt: true for each request. This allows you to re-use KV cache, so when you add a new message to chat history, it won't need re-evaluate all old messages.

However, this approach only works when you add more messages to the history. This won't work with sliding window, because we remove earlier messages in the history. Please have a look at #5793 for more details.

Reduce chat history context sent to the server/ai-model to be
just the system-prompt, prev-user-request-and-ai-response and
cur-user-request, instead of the previous full chat history.
This way if there is any response with garbage/repeatation, it
doesnt mess with things beyond the next question, in some ways.

Increase max_tokens to 1024, so that a relatively large previous
reponse doesnt eat up the space available wrt next query-response.
However dont forget that the server when started should also
be started with a model context size of 1k or more, to be on
safe side.

Add frequency and presence penalty fields set to 1.2 to the set
of fields sent to server along with the user query. So that
the model is partly set to try avoid repeating text in its
response.
The /completions endpoint of examples/server doesnt take max_tokens,
instead it takes the internal n_predict, for now add the same on
the client side, maybe later add max_tokens to /completions endpoint
handling.
@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 24, 2024

Hi @ngxson @ggerganov

Did test with cache_params, didnt change things much. Also specifically didnt check wrt speed impact if any, rather the garbage phase will eat up time. As with before max_tokens ensures that garbage text repeatation doesnt runaway.

Additionally presence/frequence penalty, also didnt help wrt avoiding garbage text repeatation wrt the queries which generated the same.

Tested the above two aspects for now with Llama3 8b.

As a sum total of all, from a overall perspective, for now have setup the defaults in SimpleChat ie client side such that

  1. by default it sends only system-prompt + immidiate-prev-user-request-ai-response + cur-request (ie chat mode)
    • even the chat displayed will update accordingly.
  2. max_tokens is set to 1024 (have added n_predict also set to same)
    • server by default may set model ctxt size to 512 tokens or so, one may want to modify that also when running
  3. frequency/presence penalty is set to 1.2

Developer/end-user is shown this info when ever a new-chat is started. And they can modify these if required, by changing gMe from browser's devel-tools/console, for now.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 24, 2024

@hanishkvc The context can be dynamically managed. Another possibility is to make a dynamic upper limit defined by the user. Artificially limiting the context seems undesirable.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 24, 2024

Hi @teleprint-me @ggerganov @ngxson

Currently llama.cpp/server implicitly limits the model context size. What I have seen is that by default it is 512 tokens, ie if one runs server without -c CONTEXT_SIZE argument. And either way independent of llama.cpp, most of the current models either way have implicit relatively small context size limits. Things are moving towards larger implicit ctxt size or sliding ctxt window on the model side etal, but not fully evolved nor wide spread yet.

Inturn what I have seen is that when the [system-prompt + chat-history + current-query] and inturn the response generated for the current-query together exceed this context size, most of the times it normally leads to repeating garbage text, which seems to implicitly stop once everything including garbage text reaches around 8K tokens (some limit enforced by llama.cpp or so potentially I havent looked into), which inturn will take a sufficiently lot of time on laptops without GPU etal. Using max_tokens hint from client side, forces the server to stop once that many tokens have been generated by the ai-model, irrespective of whether it indicates end of generation or not.

Given that client doesnt have control over the model's context size, as well as no control over the question/query that will be asked by the end user, which inturn may lead to a long answer which can inturn lead to context size exceeding, so that is the reason why I am forcing that max_tokens limit. Its also the reason why I implemented the client side sliding window concept to try and see that the context loaded by the system+history+query doesnt occupy too much of the context size on average.

Currently I havent added a UI specifically for setting it and other config parameters. However I expose most of the logic / behaviour controlling parameters over a global variable gMe, so that one can modify them from browser's devel-tool/console.

Maybe I will add a UI to control these settings. I originally started on SimpleChat, more to test some logic wrt llama.cpp and inturn server. So idea has been to keep the code basic and minimal, so that anyone wanting to test llama.cpp and server dont have to dig through code too much to change things or so.

Also I havent looked into llama.cpp and server code flow in detail currently, only specific parts when I faced some issue. So not sure of the llama and server flow fully. At same time I assume currently to allow for accomodating fixed context and sliding context related models etal, in a straight and simple way maybe llama.cpp and server inturn dont implicitly stop on context size being reached, instead depend on max_tokens/n_predict wrt when to force a stop.

@hanishkvc The context can be dynamically managed. Another possibility is to make a dynamic upper limit defined by the user. Artificially limiting the context seems undesirable.

Copy link
Collaborator

@mofosyne mofosyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick code review. Looks alright, but will wait for one other reviewer for a bit before merging.

Regarding targeted audience. I would think ordinary people would prefer the standard interface. As you said, this is more targeted towards developers so feature set should be lean and mean and focus on explainability of the code itself. If anything, you may want to emphasise that point in code, so developers don't try to feature creep it.

@hanishkvc
Copy link
Contributor Author

Just did a quick code review. Looks alright, but will wait for one other reviewer for a bit before merging.

Regarding targeted audience. I would think ordinary people would prefer the standard interface. As you said, this is more targeted towards developers so feature set should be lean and mean and focus on explainability of the code itself. If anything, you may want to emphasise that point in code, so developers don't try to feature creep it.

The idea is to be easy enough to use for basic purposes, while also being simple and easily discernable by developers who may not be from web frontend background (so inturn may not be familiar with template/end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things.

Because there is some amount of experimentation currently going on with the default interface exposed by server, so chances are for now this may be more approachable for ordinary users also.

And given that the idea is also to help explore/experiment for developers, some flexibility is provided to change behaviour easily using the devel-tools/console for now. And skeletal logic has been implemented to explore some of the end points and ideas/implications around them.

@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 25, 2024
@mofosyne
Copy link
Collaborator

Sounds reasonable enough. Let's see if we get any positive assessment or objection, otherwise I'll merge it in a few days or so.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 25, 2024

Hey @mofosyne,

Please do go ahead and merge this in, this is a logical bunch of straight forward cleanup and improvement to the initial SimpleChat.

If possible I want to start a new PR on a optional simple histogram based auto garbage identification and trimming logic, and inturn dial down frequence/presence penalty, bcas those penalities just makes the garbage text repeatition, more difficult to identify, because in some of the models like llama3 it just seems to over-intelligently repeat the text, but with small variations between the repeatations, making it difficult to auto identify the garbage.

And also maybe a simple setup ui.

But I want to branch it from the merged master here, rather than this PR branch. And finish it tomorrow itself or so, so that I can get back to other things.

@mofosyne
Copy link
Collaborator

mofosyne commented May 26, 2024

@hanishkvc okay, had a second lookover. Considering the isolated changes to an isolated subsystem, happy to merge in so you can proceed with your next idea.

But just gonna double check, do you really want to add these extra feature or is this feature creeping? What is your justification for adding these in this 'simple' ui, rather than placing it in the main tool for instance. You need to balance between the code objective to be read vs the code acting as a developer tool. If needed and you can justify it, then should we make a 'dev tool' ui folder and keep this as a stripped down demo?

@mofosyne mofosyne merged commit b9adcbb into ggerganov:master May 26, 2024
22 checks passed
@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 26, 2024

@mofosyne I agree in principle,

My idea is to keep the settings related code confined to the Me class (which maintains the global configurable parameters and shows it) or a equivalent new class/file, and the dumb garbage trimming into its own helper class/file. With minimal change to existing code flow and bit more of cleanup/structuring wrt moving some existing helpers into their own file like a ui file or so.

So the end goal is to still keep it simple and potentially even more structured, while also hopefully making it easier to use some of its flexibility.

@hanishkvc
Copy link
Contributor Author

PR #7548 extends SimpleChat with a simple minded garbage trimming logic and a minimal settings ui.

@khimaros
Copy link
Contributor

@hanishkvc i suspect your new UI may be running into the same caching issues that affect other third party clients, as described in: #7185

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Jun 14, 2024

Hi @khimaros

I had looked at prompt_cache sometime back. However because I use relatively fixed system-prompt + a sliding window into user chat flow, so the current prompt_cache caching which expects a contiguous history (for its own valid reasons), wont help much from re-use of cached data perspective.

But at the same time, technically there shouldnt be any harm in keeping the prompt_cache enabled, so that for now the system-prompt part (and in future maybe the sliding user-chat part also can get reused in appropriate way, when caching logic acquires that capability). I will send a patch for simplechat keeping it enabled, as well as adding a delete button if there is previous chat history available from a previous session, that one may want to remove, tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants